SPARK-1252. On YARN, use container-log4j.properties for executors #148

sryza · 2014-03-15T07:52:41Z

container-log4j.properties is a file that YARN provides so that containers can have log4j.properties distinct from that of the NodeManagers.

Logs now go to syslog, and stderr and stdout just have the process's standard err and standard out.

I tested this on pseudo-distributed clusters for both yarn (Hadoop 2.2) and yarn-alpha (Hadoop 0.23.7)/

AmplabJenkins · 2014-03-15T08:41:09Z

Merged build triggered.

AmplabJenkins · 2014-03-15T08:41:09Z

Merged build started.

AmplabJenkins · 2014-03-15T10:40:36Z

Merged build finished.

AmplabJenkins · 2014-03-15T10:40:37Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13194/

pwendell · 2014-03-15T18:26:09Z

Seems reasonable to me. You still working on this or is it good to go?

mridulm · 2014-03-15T20:03:33Z

I am not sure what the intent of this PR is.
log config for workers should pretty much mirror what is in master.

Also, the hardcoding of the config file, root logger levels, etc when we already ship a logging property file seems counter intutive - I am probably missing something here.

pwendell · 2014-03-15T23:54:31Z

@mridulm I think in YARN environments cluster operators can set a logging file on all of the machines to be shared across applications (e.g. Spark, MapReduce, etc). So the idea here is to point to that file.

That said @sryza one thing here is that even if a user bundles a log4j file it will get ignored and replaced with the cluster default. That actually seems not ideal and I'm not sure there is a good way to detect whether the user has included their own file.

mridulm · 2014-03-16T04:30:13Z

But that would be to debug yarn/hadoop api's primarily - and no easy way to inject spark specific logging levels. I am curious why this was required actually.

Currently, we have fairly fine grained control over logging from various packages/classes by redirecting logging output to stdout/stderr - which is actually quite heavily used (mute most of spark, enable user code; enable specific parts of spark for debug, etc) in user applications.

Having said that, @tgravescs did the initial logging integration in yarn, so will defer to him though.

mridulm · 2014-03-16T04:32:47Z

To clarify, I am not saying we should not be configuring what is in container-log4j.properties - but we should be trying to do that while preserving the ability to configure (and override container-log4j.properties) via user specified logging config.
Since log4j appears to initialize only once per classloader, it is going to be interesting to make this happen.

sryza · 2014-03-16T09:25:12Z

Currently, Spark doesn't ship a log4j.properties. It uses the log4j.properties that comes from Hadoop. This log4j.properties is meant for Hadoop services, not YARN containers. container-log4j.properties is a log4j configuration that Hadoop provides for containers to use.

Without this change, is there a way to control executor logging without turning on the config option that puts the user classpath first and then including a log4j.properties in the app jar? This doesn't seem like a good way to configure logging to me. Though with the change, users can still control the change by doing the same, but instead including a file named container-log4j.properties in their app jar.

I would be in favor of providing better ways for users to configure logging for executors, but, if I understand correctly, this change isn't making things worse.

mridulm · 2014-03-16T12:06:54Z

There is a user exposed option to configure log4j when run in yarn - which
is shipped as part of the job if specified.

On Sun, Mar 16, 2014 at 2:25 AM, Sandy Ryza [email protected]:

Currently, Spark doesn't ship a log4j.properties. It uses the
log4j.properties that comes from Hadoop. This log4j.properties is meant for
Hadoop services, not YARN containers. container-log4j.properties is a log4j
configuration that Hadoop provides for containers to use.

Without this change, is there a way to control executor logging without
turning on the config option that puts the user classpath first and then
including a log4j.properties in the app jar? This doesn't seem like a good
way to configure logging to me. Though with the change, users can still
control the change by doing the same, but instead including a file named
container-log4j.properties in their app jar.

I would be in favor of providing better ways for users to configure
logging for executors, but, if I understand correctly, this change isn't
making things worse.

Reply to this email directly or view it on GitHubhttps://github.com//pull/148#issuecomment-37752740
.

pwendell · 2014-03-16T16:34:32Z

@sryza when a user builds an application assembly jar, they are allowed to bundle their own log4j.properties file in the jar. Is this not working for you on YARN? Spark's own build intentionally ignores the Hadoop log4j.properties file so that it won't be detected:

https://github.com/apache/spark/blob/master/assembly/pom.xml#L130

If a user is building their own assembly jar they would need to similarly exclude other log4j.properties files that might get pulled in and put their own one in src/main/resources so it gets bundled in the jar.

mridulm · 2014-03-16T17:53:40Z

Unless you are a spark developer, including at Yahoo, the person building the assembly jar is not the same as the person using spark : so depending on assembled jar containing the logging config is not a practical solution.
Which is why, atleast in yarn case, I had added the ability to specify the log4j config from cli : and I know this is working till 0.9 (I am yet to move to trunk).

pwendell · 2014-03-16T18:09:58Z

@mridulm I'm referring to the application jar, not the Spark assembly. The spark assembly always excludes log4j.properties intentionally.

mridulm · 2014-03-16T18:13:49Z

Ah ok !
But wont that not be a function of how the user builds his jar ? It is much more easier to modify a config file than to rebuild a jar (particularly when the staging node does not have the source :-) )

sryza · 2014-03-16T18:55:51Z

@pwendell Spark's build excluding log4j.properties is not enough to keep it off the executor classpath. Executor classpaths include the Hadoop jars as installed locally on the cluster machines. And those include a log4j.properties.

@mridulm My bad, I hadn't noticed that there was a way to pass a log4j.properties. In that case, I think the best thing would be to use container-log4j.properties by default, but, if the client gives one, to use that instead. What do you think?

mridulm · 2014-03-16T18:57:36Z

Sure, or merge it with the user provided logging config even if provided
(which could be quite tricky, maybe not a good idea).
I just want to ensure that users have the ability to customize and/or
override logging config for their specific app.

On Sun, Mar 16, 2014 at 11:55 AM, Sandy Ryza [email protected]:

@pwendell https://github.com/pwendell Spark's build excluding
log4j.properties is not enough to keep it off the executor classpath.
Executor classpaths include the Hadoop jars as installed locally on the
cluster machines. And those include a log4j.properties.

@mridulm https://github.com/mridulm My bad, I hadn't noticed that there
was a way to pass a log4j.properties. In that case, I think the best thing
would be to use container-log4j.properties by default, but, if the client
gives one, to use that instead. What do you think?

Reply to this email directly or view it on GitHubhttps://github.com//pull/148#issuecomment-37766301
.

pwendell · 2014-03-16T20:32:03Z

It's unfortunate that Hadoop publishes the log4j.properties file directly in it's jars. This is exactly why we've avoided doing this in Spark because it creates a weird situation where you can't easily control logging preferences.

@sryza @mridulm - I think what we need is a precedence order here:

If a user explicitly gives a log4j.properties, use that (@mridul - I didn't actually know we allowed this in the YARN launcher, where is the code relevant to that?)
If a user jar has a log4j.properties file in it, we should use that.
If not, we can use the YARN cluster defaults when running on YARN.

mridulm · 2014-03-16T23:11:26Z

Check for use of SPARK_LOG4J_CONF in yarn/
I think primarily in
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
(The code has changed quite a bit since I wrote it, so I can only give
pointers unfortunately)

On Sun, Mar 16, 2014 at 1:32 PM, Patrick Wendell
[email protected]:

It's unfortunate that Hadoop publishes the log4j.properties file directly
in it's jars. This is exactly why we've avoided doing this in Spark because
it creates a weird situation where you can't easily control logging
preferences.

@sryza https://github.com/sryza @mridulm https://github.com/mridulm -
I think what we need is a precedence order here:

If a user explicitly gives a log4j.properties, use that (@mridulhttps://github.com/mridul- I didn't actually know we allowed this in the YARN launcher, where is the
code relevant to that?)

If a user jar has a log4j.properties file in it, we should use that.

If not, we can use the YARN cluster defaults when running on YARN.

Reply to this email directly or view it on GitHubhttps://github.com//pull/148#issuecomment-37769373
.

tgravescs · 2014-03-17T15:00:13Z

As pointed out the SPARK_LOG4J_CONF env variable can be used to set the logging for spark on yarn.

The one downside to making the yarn one default is that we now get different looking logs if the user just uses the spark one. In the default most things go to syslog file, and if I just put in the conf/log4j.properties by copying the template I won't get a syslog file and most things will be in stdout right?

sryza · 2014-03-17T17:03:55Z

@tgravescs separating out log4j logs from stdout seems like a possible improvement to me, but if you think it will cause confusion, we can still use container-log4j.properties and use JVM props to configure it to log to stdout.

tgravescs · 2014-03-18T13:54:36Z

I agree it could be an improvement. I just don't want to cause confusion when their logs go to a different file when they turn debug on. I'm not sure I have any other good ideas on ways around it. Unless we just document (or put lines in the log4j.properties.template for yarn) that on yarn to get the same behavior use org.apache.hadoop.yarn.ContainerLogAppender.

@pwendell thoughts?

AmplabJenkins · 2014-03-27T02:24:24Z

Merged build triggered.

AmplabJenkins · 2014-03-27T02:24:24Z

Merged build started.

AmplabJenkins · 2014-03-27T03:12:48Z

Merged build finished.

AmplabJenkins · 2014-03-27T03:12:48Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13490/

AmplabJenkins · 2014-03-27T05:14:14Z

Merged build triggered.

AmplabJenkins · 2014-03-27T05:14:14Z

Merged build started.

sryza · 2014-03-27T22:06:07Z

I uploaded a patch with a new approach. In it, we use neither Hadoop's log4j.properties nor Hadoop's container-log4j.properties. I added a log4j-spark-container.properties file (with the same conf as conf/log4j.properties.template) that gets included in the Spark yarn jar. This is used by default, but if the user specifies a SPARK_LOG4J_CONF, that will get used instead.

I think this should address the issues that y'all have pointed out with the original approach, but let me know if I'm missing anything.

AmplabJenkins · 2014-03-28T01:55:03Z

Can one of the admins verify this patch?

pwendell · 2014-04-04T17:45:48Z

@tgravescs mind taking a look at this?

tgravescs · 2014-04-04T18:42:16Z

yarn/alpha/src/main/resources/log4j-spark-container.properties

+log4j.rootCategory=INFO, console
+log4j.appender.console=org.apache.log4j.ConsoleAppender
+log4j.appender.console.target=System.err
+log4j.appender.console.layout=org.apache.log4j.PatternLayout


do we need to have 2 different log4j properties files or can they go in common?

I tried putting them in common initially and they weren't showing up in the YARN jar. I just tried mucking around with Maven a bit and was able to get this to work, so I'm uploading a new patch. I'm not sure how to do the same in sbt though.

AmplabJenkins · 2014-04-04T19:47:23Z

Merged build triggered.

AmplabJenkins · 2014-04-04T19:47:28Z

Merged build started.

AmplabJenkins · 2014-04-04T21:08:50Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-04T21:08:50Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13773/

mridulm · 2014-04-05T22:40:57Z

Just to be clear, the current status of the patch seems to be :
a) If user specified logging config - use that.
b) If missing, use a spark config built into the jar.

(b) seems to be different from the original intent of the PR, but I guess if we cant merge the container logging config with ours, it cant be helped.
What about simply copying the existing hadoop log4j container config into (b) and expanding the variables as was done in the first PR when (a) is missing ? (it would have the nice property of logs going to syslog in that case, no ? Or wont that not work ?)
Also, this is distinct from what happens in the master ? Or applies there too ?

sryza · 2014-04-07T06:52:11Z

Right, (b) is different than the original intent of the PR. The reason for copying Spark's log4j instead of Hadoop's was the concern brought up by @tgravescs earlier:

The one downside to making the yarn one default is that we now get different looking logs if the user just uses the spark one. In the default most things go to syslog file, and if I just put in the conf/log4j.properties by copying the template I won't get a syslog file and most things will be in stdout right?

This applies to the master too.

tgravescs · 2014-04-07T13:48:19Z

Changes look good. Sandy can you merge up to the latest master and I'll commit.

tgravescs · 2014-04-07T13:53:47Z

Note the other option would be to change the conf/log4j.properties.template to be more like hadoops.

sryza · 2014-04-07T17:06:17Z

Thanks Tom. I just rebased on master.

Note the other option would be to change the conf/log4j.properties.template to be more like hadoops.

I don't have an opinion on this, but happy to make the change if you think it's the right thing.

AmplabJenkins · 2014-04-07T17:07:25Z

Merged build triggered.

AmplabJenkins · 2014-04-07T17:07:35Z

Merged build started.

AmplabJenkins · 2014-04-07T17:59:43Z

Merged build finished.

AmplabJenkins · 2014-04-07T17:59:43Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13847/

Include appId in executor cmd line args add the appId back into the executor cmd line args. I also made a pretty lame regression test, just to make sure it doesn't get dropped in the future. not sure it will run on the build server, though, b/c `ExecutorRunner.buildCommandSeq()` expects to be abel to run the scripts in `bin`. (cherry picked from commit 3d4ad84) Signed-off-by: Reynold Xin <[email protected]>

container-log4j.properties is a file that YARN provides so that containers can have log4j.properties distinct from that of the NodeManagers. Logs now go to syslog, and stderr and stdout just have the process's standard err and standard out. I tested this on pseudo-distributed clusters for both yarn (Hadoop 2.2) and yarn-alpha (Hadoop 0.23.7)/ Author: Sandy Ryza <[email protected]> Closes apache#148 from sryza/sandy-spark-1252 and squashes the following commits: c0043b8 [Sandy Ryza] Put log4j.properties file under common 55823da [Sandy Ryza] Add license headers to new files 10934b8 [Sandy Ryza] Add log4j-spark-container.properties and support SPARK_LOG4J_CONF e74450b [Sandy Ryza] SPARK-1252. On YARN, use container-log4j.properties for executors

* [SPARK-475] Skip HDFS tests * Disable HDFS

Use role to export openrc in k8s cinder in-tree job

### What changes were proposed in this pull request? Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { val df1 = spark.range(100).select($"id".as("k1")) val df2 = spark.range(100).select($"id".as("k2")) val df3 = spark.range(3).select($"id".as("k3")) val df4 = spark.range(100).select($"id".as("k4")) val plan = df1.join(df2, $"k1" === $"k2") .join(df3, $"k1" === $"k3") .join(df4, $"k1" === $"k4") .queryExecution .executedPlan } ``` Current physical plan (extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` Ideal physical plan (no extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` ### Why are the changes needed? To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table. Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29181 from c21/ordering. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

tgravescs reviewed Apr 4, 2014
View reviewed changes

sryza added 4 commits April 7, 2014 10:03

SPARK-1252. On YARN, use container-log4j.properties for executors

e74450b

Add log4j-spark-container.properties and support SPARK_LOG4J_CONF

10934b8

Add license headers to new files

55823da

Put log4j.properties file under common

c0043b8

asfgit closed this in 9dd8b91 Apr 7, 2014

mccheah pushed a commit to mccheah/spark that referenced this pull request Apr 7, 2017

Get rid of default channels in generated condarc... (apache#148)

08f044c

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

[SPARK-475] Skip HDFS tests (apache#148)

30e1d93

* [SPARK-475] Skip HDFS tests * Disable HDFS

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#148 from animationzl/use-role-export-openrc

ba6a52b

Use role to export openrc in k8s cinder in-tree job

c21 mentioned this pull request Jul 22, 2020

[SPARK-32383][SQL] Preserve hash join (BHJ and SHJ) stream side ordering #29181

Closed

SPARK-1252. On YARN, use container-log4j.properties for executors #148

SPARK-1252. On YARN, use container-log4j.properties for executors #148

Conversation

sryza commented Mar 15, 2014

AmplabJenkins commented Mar 15, 2014

AmplabJenkins commented Mar 15, 2014

AmplabJenkins commented Mar 15, 2014

AmplabJenkins commented Mar 15, 2014

pwendell commented Mar 15, 2014

mridulm commented Mar 15, 2014

pwendell commented Mar 15, 2014

mridulm commented Mar 16, 2014

mridulm commented Mar 16, 2014

sryza commented Mar 16, 2014

mridulm commented Mar 16, 2014

pwendell commented Mar 16, 2014

mridulm commented Mar 16, 2014

pwendell commented Mar 16, 2014

mridulm commented Mar 16, 2014

sryza commented Mar 16, 2014

mridulm commented Mar 16, 2014

pwendell commented Mar 16, 2014

mridulm commented Mar 16, 2014

tgravescs commented Mar 17, 2014

sryza commented Mar 17, 2014

tgravescs commented Mar 18, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

sryza commented Mar 27, 2014

AmplabJenkins commented Mar 28, 2014

pwendell commented Apr 4, 2014

tgravescs Apr 4, 2014

Choose a reason for hiding this comment

sryza Apr 4, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Apr 4, 2014

AmplabJenkins commented Apr 4, 2014

AmplabJenkins commented Apr 4, 2014

AmplabJenkins commented Apr 4, 2014

mridulm commented Apr 5, 2014

sryza commented Apr 7, 2014

tgravescs commented Apr 7, 2014

tgravescs commented Apr 7, 2014

sryza commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014

AmplabJenkins commented Apr 7, 2014